fix(timing): warmup pass before timing loop to amortise torch.compile JIT#70
Merged
Conversation
… JIT Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost (~887 ms vs ~1 ms steady-state), skewing Phase Timing numbers — especially at low sample counts like PREDECODER_INFERENCE_NUM_SAMPLES=1. The warmup only runs when torch.compile is active and TRT is not in use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracts the warmup block into a named helper so it can be tested in isolation. Five tests cover: fires when compile is active (CPU), skipped when compile is off, skipped when TRT context is present, CUDA sync called on GPU device, CUDA sync not called on CPU device. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bmhowe23
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pipeline_modulebefore the timing loop inrun_inference_and_decode_pre_decoder_memorytorch.compilelazy compilation so the JIT cost does not inflate the first-batch timing measurementtrt_context is None and _applied_compile(torch-only path with compile enabled)_maybe_warmup_compilehelper with 5 unit testsMotivation
Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost, skewing Phase Timing numbers — especially at low sample counts (
PREDECODER_INFERENCE_NUM_SAMPLES=1):With large sample counts the JIT cost gets amortised naturally, but at small counts it dominates and makes Phase Timing numbers misleading. Proposed by Igor Almeida Baratta; approved by Ben Howe.
Test plan
test_inference_latency_timing.py,test_tensorrt_fallback.py)PREDECODER_INFERENCE_NUM_SAMPLES=1, confirm first-batch model-forward time matches steady-state🤖 Generated with Claude Code